ID Mapping

This notebook explores the relationship between 'customers', schools sites, account_uuids and meter_uuids in the prop 39 data.

In general, an account_uuid generally corresponds with an individual school, although there are some exceptions (which generally look to be typos and other data quality issues).

'Customers' may be school districts or other entities, but may also be schools themselves.

There is a strictly heirarchical tree relationship between the following three columns:

customer_name > account_uuid > meter_uuid

Other notes from looking at the PEPS_Data.xlsx definitions table:

LEA (Local Educational Agency) may point to a school but may also point to a district, probably not useful for matching to a meter
"Site CDS Code" = "The unique identifying code based on the CDS code for the school or district facility." If this is available in the xml files, we should grab it from there

Notebook Config



In [1]:

    
db_pass = ""
with open('data_db.cfg','r') as f:
    for line in f:
        s = line.split("=")
        if s[0].strip() == "user":
            db_user = s[1].strip()
        if s[0].strip() == "pass":
            db_pass = ":" + s[1].strip()
        if s[0].strip() == "db":
            db_db = s[1].strip()



In [2]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import mysql.connector, os, datetime
from sqlalchemy import create_engine



In [3]:

    
engine = create_engine('mysql+mysqlconnector://' + db_user + db_pass + '@localhost/' + db_db, echo=False)
conn = engine.connect()

Account table



In [4]:

    
account_df = pd.read_sql('SELECT * FROM account;', con=engine)
print len(account_df)
account_df.head(3)









    



2185






    Out[4]:






  
    
      
      id
      account_uuid
      meter_uuid
      zip5
      customer_account
      customer_city
      customer_name
      customer_zip
      lea_customer
      rate_schedule_id
      school_city
      school_site_name
      school_site_zip
      utility_name
    
  
  
    
      0
      1
      1100176106751
      8496493494
      94544
      None
      HAYWARD
      ALAMEDA COUNTY OFFICE OF EDUCATION
      94544-1136
      None
      E19S
      Hayward
      Alameda County Special Education
      94544-1136
      PacificGasElectric
    
    
      1
      2
      1612340000000
      5637199749
      94560
      None
      NEWARK
      NEWARK UNIFIED SCHOOL DISTRICT
      94560-2554
      None
      HA10S
      Newark
      LEA
      94560-0385
      PacificGasElectric
    
    
      2
      3
      1612340130054
      5637199200
      94560
      None
      NEWARK
      NEWARK UNIFIED SCHOOL DISTRICT
      94560-2554
      None
      A10S
      Newark
      Newark Memorial High
      94560-5007
      PacificGasElectric

General lay of the land



In [5]:

    
df = pd.read_sql('SELECT * FROM account LIMIT 1;', con=engine)
for c in df.columns:
    df[c] = pd.read_sql('SELECT COUNT(DISTINCT(' + c + ')) FROM account;', con=engine)
df









    Out[5]:






  
    
      
      id
      account_uuid
      meter_uuid
      zip5
      customer_account
      customer_city
      customer_name
      customer_zip
      lea_customer
      rate_schedule_id
      school_city
      school_site_name
      school_site_zip
      utility_name
    
  
  
    
      0
      2185
      1078
      2185
      302
      0
      149
      192
      188
      0
      26
      203
      960
      1018
      1



In [6]:

    
query = '''
SELECT COUNT(count) 
FROM (
  SELECT COUNT(*) AS count
  FROM account
  GROUP BY school_city, school_site_name
) AS t1;
'''
pd.read_sql(query, con=engine)









    Out[6]:






  
    
      
      COUNT(count)
    
  
  
    
      0
      1068

Based on the above, we summarize:

there are 2185 rows in the accounts table
there are 2185 distinct meter_uuid values
there are 1078 distinct account_uuid values
there are 1068 distinct school_city + school_site_name pairs
there are 1018 distinct school_site_zips (9-digit zips)
there are 192 distinct customer_names

How does account_uuid relate to distinct school_city + school_name pairs?



In [7]:

    
account_df['site_pair'] = account_df['school_city'] + "_" + account_df['school_site_name']



In [8]:

    
comp = account_df[['site_pair','account_uuid']]
print len(comp)
comp = comp.drop_duplicates()
print len(comp)
comp.head(3)









    



2185
1083






    Out[8]:






  
    
      
      site_pair
      account_uuid
    
  
  
    
      0
      Hayward_Alameda County Special Education
      1100176106751
    
    
      1
      Newark_LEA
      1612340000000
    
    
      2
      Newark_Newark Memorial High
      1612340130054

Note that 1083 is greater than 1078, which is the number of distinct account_uuids: there are thus some account_uuids with multiple site_pair associations. As shown below, these cases look to be caused by discrepencies in site naming texts.



In [9]:

    
account_uuid_doubles = comp[comp.duplicated(subset='account_uuid', keep=False)].sort_values(by='account_uuid')
account_uuid_doubles









    Out[9]:






  
    
      
      site_pair
      account_uuid
    
  
  
    
      684
      FRESNO_Sequoia Middle
      10621666057335
    
    
      1816
      Fresno_Sequoia Middle
      10621666057335
    
    
      849
      Bakersfield_North High School
      15635291534775
    
    
      1858
      Bakersfield_North High
      15635291534775
    
    
      1241
      Arroyo Grande_Arroyo Grande High
      40687594030557
    
    
      1969
      ARROYO GRANDE_ARROYO GRANDE HIGH
      40687594030557
    
    
      1312
      REDWOOD CITY_LEA
      41690620000000
    
    
      2039
      SAN CARLOS_LEA
      41690620000000
    
    
      1349
      SUNNYVALE_CUPERTINO MIDDLE
      43694196046767
    
    
      2048
      Sunnyvale_Cupertino Middle
      43694196046767

There are more cases of duplicated site_pair values. These look to be caused by either typos in the account_uuid values, or by the use of "LEA" in the site_name category.



In [10]:

    
site_pair_doubles = comp[comp.duplicated(subset='site_pair', keep=False)].sort_values(by='site_pair')
site_pair_doubles









    Out[10]:






  
    
      
      site_pair
      account_uuid
    
  
  
    
      839
      Bakersfield_Bakersfield Adult School
      15635291530708
    
    
      1853
      Bakersfield_Bakersfield Adult School
      15635291530609
    
    
      950
      PETALUMA_LEA
      21653420000000
    
    
      1876
      PETALUMA_LEA
      21653750000000
    
    
      1125
      Quincy_LEA
      32669690000000
    
    
      1913
      Quincy_LEA
      32103220000000
    
    
      1615
      Red Bluff_LEA
      52716390000000
    
    
      2160
      Red Bluff_LEA
      52716210000000
    
    
      1032
      SALINAS_LEA
      27661420000000
    
    
      1078
      SALINAS_LEA
      27738250000000
    
    
      1373
      San Jose_LEA
      43694270000000
    
    
      2064
      San Jose_LEA
      43694350000000
    
    
      1592
      VACAVILLE_CAMBRIDGE
      48705656103154
    
    
      1593
      VACAVILLE_CAMBRIDGE
      48705856103154

Hierarchical tree relationship

Testing this hypothesis:

The following three fields display a hierarchical tree relationship: customer_name > account_uuid > meter_uuid

Note that this would mean that (a) each distinct meter_uuid must be associated with just one account_uuid and that (b) each distinct account_uuid must be associated with just one customer_name.

Also note that (a) is true since there is only one instance of each meter_uuid in the table, so we just need to test (b).



In [11]:

    
customer_name_account_df = account_df[['customer_name','account_uuid']]
customer_name_account_df = customer_name_account_df.drop_duplicates()
print len(customer_name_account_df)
customer_name_account_df.head(3)









    



1078






    Out[11]:






  
    
      
      customer_name
      account_uuid
    
  
  
    
      0
      ALAMEDA COUNTY OFFICE OF EDUCATION
      1100176106751
    
    
      1
      NEWARK UNIFIED SCHOOL DISTRICT
      1612340000000
    
    
      2
      NEWARK UNIFIED SCHOOL DISTRICT
      1612340130054

Note from the 'lay of the land' section above that there are 1078 distinct account_uuids in the table, so they are all unique and thus must each be associated with just one customer_name.

So the hierarchical relationship holds.

There are various ways of visualizing the overall mapping - e.g. as a tree or sunburst, below we just consider the distributions of accounts per customer and meters per account.



In [12]:

    
customer_name_groups = customer_name_account_df.groupby('customer_name').count()
customer_name_groups.head()









    Out[12]:






  
    
      
      account_uuid
    
    
      customer_name
      
    
  
  
    
      ACALANES UNION HIGH SCHOOL DISTRICT
      5
    
    
      ALAMEDA COUNTY OFFICE OF EDUCATION
      1
    
    
      ALLENSWORTH ELEMENTARY SCHOOL
      1
    
    
      ALTERNATIVES IN ACTION
      1
    
    
      ALVINA SCHOOL
      1



In [13]:

    
customer_name_groups.hist(bins=50)









    Out[13]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x107e84710>]], dtype=object)



In [14]:

    
account_groups = account_df[['account_uuid','meter_uuid']].groupby('account_uuid').count()
account_groups.head()









    Out[14]:






  
    
      
      meter_uuid
    
    
      account_uuid
      
    
  
  
    
      10619946005730
      2
    
    
      10621170000000
      3
    
    
      10621170106419
      1
    
    
      10621170108142
      1
    
    
      10621170111153
      3



In [15]:

    
account_groups.hist(bins=16)









    Out[15]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x107fccf50>]], dtype=object)

	id	account_uuid	meter_uuid	zip5	customer_account	customer_city	customer_name	customer_zip	lea_customer	rate_schedule_id	school_city	school_site_name	school_site_zip	utility_name
0	1	1100176106751	8496493494	94544	None	HAYWARD	ALAMEDA COUNTY OFFICE OF EDUCATION	94544-1136	None	E19S	Hayward	Alameda County Special Education	94544-1136	PacificGasElectric
1	2	1612340000000	5637199749	94560	None	NEWARK	NEWARK UNIFIED SCHOOL DISTRICT	94560-2554	None	HA10S	Newark	LEA	94560-0385	PacificGasElectric
2	3	1612340130054	5637199200	94560	None	NEWARK	NEWARK UNIFIED SCHOOL DISTRICT	94560-2554	None	A10S	Newark	Newark Memorial High	94560-5007	PacificGasElectric

	site_pair	account_uuid
0	Hayward_Alameda County Special Education	1100176106751
1	Newark_LEA	1612340000000
2	Newark_Newark Memorial High	1612340130054

	site_pair	account_uuid
684	FRESNO_Sequoia Middle	10621666057335
1816	Fresno_Sequoia Middle	10621666057335
849	Bakersfield_North High School	15635291534775
1858	Bakersfield_North High	15635291534775
1241	Arroyo Grande_Arroyo Grande High	40687594030557
1969	ARROYO GRANDE_ARROYO GRANDE HIGH	40687594030557
1312	REDWOOD CITY_LEA	41690620000000
2039	SAN CARLOS_LEA	41690620000000
1349	SUNNYVALE_CUPERTINO MIDDLE	43694196046767
2048	Sunnyvale_Cupertino Middle	43694196046767

	site_pair	account_uuid
839	Bakersfield_Bakersfield Adult School	15635291530708
1853	Bakersfield_Bakersfield Adult School	15635291530609
950	PETALUMA_LEA	21653420000000
1876	PETALUMA_LEA	21653750000000
1125	Quincy_LEA	32669690000000
1913	Quincy_LEA	32103220000000
1615	Red Bluff_LEA	52716390000000
2160	Red Bluff_LEA	52716210000000
1032	SALINAS_LEA	27661420000000
1078	SALINAS_LEA	27738250000000
1373	San Jose_LEA	43694270000000
2064	San Jose_LEA	43694350000000
1592	VACAVILLE_CAMBRIDGE	48705656103154
1593	VACAVILLE_CAMBRIDGE	48705856103154

	account_uuid
customer_name
ACALANES UNION HIGH SCHOOL DISTRICT	5
ALAMEDA COUNTY OFFICE OF EDUCATION	1
ALLENSWORTH ELEMENTARY SCHOOL	1
ALTERNATIVES IN ACTION	1
ALVINA SCHOOL	1

	meter_uuid
account_uuid
10619946005730	2
10621170000000	3
10621170106419	1
10621170108142	1
10621170111153	3